mean_poisson_deviance#
mean_poisson_deviance is a regression loss for non-negative targets (typically counts). It is the (weighted) average of the Poisson deviance:
best value:
0.0(wheny_pred == y_true)lower is better
Goals
define the metric and its domain
derive the formula from the Poisson likelihood
implement it from scratch in NumPy (and validate against scikit-learn)
build intuition with plots
use it as an optimization objective for Poisson regression (GLM)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import mean_poisson_deviance
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(42)
np.set_printoptions(precision=4, suppress=True)
When should you use it?#
Use Poisson deviance when:
y_truerepresents counts (\(0,1,2,\dots\)) or other non-negative quantitiesthe conditional variance grows with the mean (a Poisson-like setting)
you want a loss that matches the Poisson likelihood (GLM / Poisson regression)
Domain constraints
For every sample:
\(y_i \ge 0\)
\(\mu_i > 0\) (predicted mean)
A common way to ensure \(\mu_i>0\) is a log link:
Definition#
For a single observation \(y \ge 0\) and a prediction \(\mu > 0\), the Poisson unit deviance is:
The mean Poisson deviance is the average over samples:
Useful special case:
For \(y>0\), writing \(r = \mu/y\) (multiplicative error):
Properties:
\(d(y,\mu) \ge 0\)
\(d(y,\mu)=0\) iff \(\mu=y\)
Where does this formula come from?#
For a Poisson model:
the log-likelihood for one observation is:
The deviance is a log-likelihood ratio between the fitted model and a saturated model that predicts perfectly (\(\mu=y\)):
So minimizing mean Poisson deviance is equivalent to maximizing the Poisson likelihood (the saturated term depends only on \(y\)).
KL view
The KL divergence between Poisson distributions satisfies:
so:
def _to_1d_float(a, name: str) -> np.ndarray:
a = np.asarray(a, dtype=float)
a = a.reshape(-1)
if not np.all(np.isfinite(a)):
raise ValueError(f"{name} must contain only finite values")
return a
def poisson_deviance_per_sample(y_true, y_pred) -> np.ndarray:
# Per-sample Poisson deviance d(y, μ)
y_true = _to_1d_float(y_true, "y_true")
y_pred = _to_1d_float(y_pred, "y_pred")
if y_true.shape != y_pred.shape:
raise ValueError("y_true and y_pred must have the same shape")
if np.any(y_true < 0):
raise ValueError("y_true must be non-negative")
if np.any(y_pred <= 0):
raise ValueError("y_pred must be strictly positive")
dev = np.empty_like(y_true, dtype=float)
mask = y_true > 0
dev[~mask] = 2.0 * y_pred[~mask] # y=0 => d(0, μ) = 2μ
dev[mask] = 2.0 * (
y_true[mask] * np.log(y_true[mask] / y_pred[mask])
- (y_true[mask] - y_pred[mask])
)
return dev
def mean_poisson_deviance_np(y_true, y_pred, sample_weight=None) -> float:
dev = poisson_deviance_per_sample(y_true, y_pred)
if sample_weight is None:
return float(dev.mean())
w = _to_1d_float(sample_weight, "sample_weight")
if w.shape != dev.shape:
raise ValueError("sample_weight must have the same shape as y_true")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
w_sum = float(w.sum())
if w_sum == 0.0:
raise ValueError("sample_weight sum must be positive")
return float(np.sum(w * dev) / w_sum)
def mean_poisson_deviance_grad_mu(y_true, y_pred, sample_weight=None) -> np.ndarray:
# Gradient of the mean deviance wrt μ (elementwise), shape (n,)
y_true = _to_1d_float(y_true, "y_true")
y_pred = _to_1d_float(y_pred, "y_pred")
if y_true.shape != y_pred.shape:
raise ValueError("y_true and y_pred must have the same shape")
if np.any(y_true < 0):
raise ValueError("y_true must be non-negative")
if np.any(y_pred <= 0):
raise ValueError("y_pred must be strictly positive")
# d(y, μ) = 2 [ y log(y/μ) - (y - μ) ]
# ∂/∂μ d(y, μ) = 2 (1 - y/μ)
grad = 2.0 * (1.0 - y_true / y_pred)
if sample_weight is None:
return grad / y_true.size
w = _to_1d_float(sample_weight, "sample_weight")
if w.shape != grad.shape:
raise ValueError("sample_weight must have the same shape as y_true")
w_sum = float(w.sum())
if w_sum == 0.0:
raise ValueError("sample_weight sum must be positive")
return (w * grad) / w_sum
# Compare against scikit-learn
y_true = rng.poisson(lam=3.0, size=200)
y_pred = rng.gamma(shape=2.0, scale=2.0, size=200) # positive
print("sklearn:", mean_poisson_deviance(y_true, y_pred))
print("numpy :", mean_poisson_deviance_np(y_true, y_pred))
w = rng.uniform(0.0, 1.0, size=y_true.size)
print("sklearn (w):", mean_poisson_deviance(y_true, y_pred, sample_weight=w))
print("numpy (w):", mean_poisson_deviance_np(y_true, y_pred, sample_weight=w))
# Tiny gradient sanity check (finite differences)
y_true_s = np.array([0.0, 1.0, 4.0])
y_pred_s = np.array([0.7, 1.2, 3.5])
g_analytical = mean_poisson_deviance_grad_mu(y_true_s, y_pred_s)
eps = 1e-6
base = mean_poisson_deviance_np(y_true_s, y_pred_s)
g_numeric = np.zeros_like(y_pred_s)
for i in range(y_pred_s.size):
y_pred_eps = y_pred_s.copy()
y_pred_eps[i] += eps
g_numeric[i] = (mean_poisson_deviance_np(y_true_s, y_pred_eps) - base) / eps
print("grad analytical:", g_analytical)
print("grad numeric :", g_numeric)
sklearn: 2.606757123122003
numpy : 2.606757123122003
sklearn (w): 2.4106910529758423
numpy (w): 2.4106910529758423
grad analytical: [ 0.6667 0.1111 -0.0952]
grad numeric : [ 0.6667 0.1111 -0.0952]
Intuition: how does the penalty behave?#
Key behaviors to look for in the plots below:
The minimum is at \(\mu=y\).
For \(y>0\), as \(\mu\to 0^+\), the loss diverges (\(\log(y/\mu)\to\infty\)).
For \(y=0\), the deviance becomes linear: \(d(0,\mu)=2\mu\).
In terms of multiplicative error \(r=\mu/y\) (for \(y>0\)), the shape is independent of \(y\):
So larger counts scale the deviance roughly linearly.
mu = np.logspace(-3, np.log10(50), 600)
ys = [0, 1, 5, 20]
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
"Unit deviance d(y, μ) as a function of μ",
"Shape vs multiplicative error r = μ / y (for y>0)",
),
)
# Left: d(y, μ) vs μ for several y
for y in ys:
y_vec = np.full_like(mu, float(y))
d = poisson_deviance_per_sample(y_vec, mu)
fig.add_trace(go.Scatter(x=mu, y=d, mode="lines", name=f"y={y}"), row=1, col=1)
fig.update_xaxes(title_text="μ (prediction)", type="log", row=1, col=1)
fig.update_yaxes(title_text="d(y, μ)", row=1, col=1)
# Right: normalized deviance shape as a function of r
r = np.logspace(-2, 2, 600)
d_over_y = 2.0 * (-np.log(r) - 1.0 + r) # since d(y, y r) / y
fig.add_trace(go.Scatter(x=r, y=d_over_y, mode="lines", name="d/y (any y>0)"), row=1, col=2)
fig.update_xaxes(title_text="r", type="log", row=1, col=2)
fig.update_yaxes(title_text="d(y, y·r) / y", row=1, col=2)
fig.update_layout(title="Mean Poisson deviance: shape and scaling")
fig.show()
Using it as a loss: Poisson regression (log link)#
Given features \(X\in\mathbb{R}^{n\times p}\) and counts \(y\in\{0,1,2,\dots\}\), Poisson regression models:
We can fit \(\beta\) by minimizing mean Poisson deviance:
Because \(y_i\log y_i\) does not depend on \(\beta\), the gradient is simple. For the unweighted mean:
(With weights \(w_i\), replace \(\frac{2}{n}\) with \(\frac{2}{\sum_i w_i}\) and multiply \((\mu-y)\) elementwise by \(w\).)
Below is a from-scratch batch gradient descent solver.
def add_intercept(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.c_[np.ones((X.shape[0], 1)), X]
def poisson_mean_from_eta(eta: np.ndarray, eta_clip: float = 20.0) -> np.ndarray:
# Clip to avoid overflow/underflow in exp in from-scratch code
eta = np.clip(eta, -eta_clip, eta_clip)
return np.exp(eta)
def poisson_regression_loss_and_grad_beta(
X: np.ndarray,
y: np.ndarray,
beta: np.ndarray,
sample_weight=None,
l2: float = 0.0,
eta_clip: float = 20.0,
):
y = _to_1d_float(y, "y")
X = np.asarray(X, dtype=float)
beta = np.asarray(beta, dtype=float).reshape(-1)
if X.shape[0] != y.shape[0]:
raise ValueError("X and y have incompatible shapes")
if X.shape[1] != beta.shape[0]:
raise ValueError("beta has incompatible shape")
if np.any(y < 0):
raise ValueError("y must be non-negative")
eta = X @ beta
mu = poisson_mean_from_eta(eta, eta_clip=eta_clip)
loss = mean_poisson_deviance_np(y, mu, sample_weight=sample_weight)
if sample_weight is None:
grad = (2.0 / X.shape[0]) * (X.T @ (mu - y))
else:
w = _to_1d_float(sample_weight, "sample_weight")
if w.shape[0] != y.shape[0]:
raise ValueError("sample_weight must have shape (n,)")
W = float(w.sum())
if W == 0.0:
raise ValueError("sample_weight sum must be positive")
grad = (2.0 / W) * (X.T @ (w * (mu - y)))
if l2 > 0.0:
# Ridge penalty on slopes (not the intercept)
loss = loss + l2 * float(np.sum(beta[1:] ** 2))
grad = grad.copy()
grad[1:] = grad[1:] + 2.0 * l2 * beta[1:]
return loss, grad
def fit_poisson_regression_gd(
X: np.ndarray,
y: np.ndarray,
sample_weight=None,
l2: float = 0.0,
lr: float = 0.05,
max_iter: int = 4000,
tol: float = 1e-10,
eta_clip: float = 20.0,
):
X = np.asarray(X, dtype=float)
y = _to_1d_float(y, "y")
beta = np.zeros(X.shape[1], dtype=float)
losses = []
for _ in range(max_iter):
loss, grad = poisson_regression_loss_and_grad_beta(
X,
y,
beta,
sample_weight=sample_weight,
l2=l2,
eta_clip=eta_clip,
)
losses.append(loss)
beta_next = beta - lr * grad
if np.linalg.norm(beta_next - beta) <= tol * (1.0 + np.linalg.norm(beta)):
beta = beta_next
break
beta = beta_next
return beta, np.array(losses)
# Synthetic Poisson regression problem
n = 800
x = rng.uniform(-2.5, 2.5, size=n)
X_raw = x.reshape(-1, 1)
# Standardize feature for easier GD tuning
X_mean = X_raw.mean(axis=0)
X_std = X_raw.std(axis=0)
X = (X_raw - X_mean) / X_std
X_i = add_intercept(X)
beta_true = np.array([0.3, 0.7])
eta_true = X_i @ beta_true
mu_true = np.exp(eta_true)
y = rng.poisson(lam=mu_true)
# Train/test split
idx = rng.permutation(n)
tr = idx[: int(0.8 * n)]
te = idx[int(0.8 * n) :]
X_tr, y_tr = X_i[tr], y[tr]
X_te, y_te = X_i[te], y[te]
# Baseline: constant mean
mu_baseline = np.full_like(y_te, y_tr.mean(), dtype=float)
baseline_dev = mean_poisson_deviance_np(y_te, mu_baseline)
# Fit from scratch
beta_hat, losses = fit_poisson_regression_gd(X_tr, y_tr, lr=0.05, max_iter=4000)
mu_hat_te = poisson_mean_from_eta(X_te @ beta_hat)
dev_te = mean_poisson_deviance_np(y_te, mu_hat_te)
print("true beta:", beta_true)
print(" hat beta:", beta_hat)
print(f"baseline MPD (constant mean): {baseline_dev:.4f}")
print(f"test MPD (from-scratch poisson reg): {dev_te:.4f}")
# Compare with scikit-learn's PoissonRegressor (same feature matrix incl. intercept)
from sklearn.linear_model import PoissonRegressor
sk = PoissonRegressor(alpha=0.0, fit_intercept=False, max_iter=1000)
sk.fit(X_tr, y_tr)
mu_sk_te = sk.predict(X_te)
print("sklearn beta:", sk.coef_)
print(f"test MPD (sklearn): {mean_poisson_deviance(y_te, mu_sk_te):.4f}")
true beta: [0.3 0.7]
hat beta: [0.3078 0.6558]
baseline MPD (constant mean): 1.9821
test MPD (from-scratch poisson reg): 1.1982
sklearn beta: [0.3078 0.6558]
test MPD (sklearn): 1.1982
# Visualize optimization + fit
iters = np.arange(len(losses))
x_te = x[te]
mu_true_te = mu_true[te]
order = np.argsort(x_te)
dev_baseline_per = poisson_deviance_per_sample(y_te, mu_baseline)
dev_model_per = poisson_deviance_per_sample(y_te, mu_hat_te)
fig = make_subplots(
rows=2,
cols=2,
subplot_titles=(
"Training loss (mean Poisson deviance)",
"Test: observed y vs predicted mean μ̂",
"Test: y vs x with mean functions",
"Test: per-sample deviance distribution",
),
)
# (1) Loss curve
fig.add_trace(go.Scatter(x=iters, y=losses, mode="lines", name="train loss"), row=1, col=1)
fig.update_xaxes(title_text="iteration", row=1, col=1)
fig.update_yaxes(title_text="mean deviance", type="log", row=1, col=1)
# (2) Predicted mean vs observed
fig.add_trace(
go.Scatter(x=mu_hat_te, y=y_te, mode="markers", name="(μ̂, y)", opacity=0.6),
row=1,
col=2,
)
max_axis = float(max(mu_hat_te.max(), y_te.max()))
fig.add_trace(
go.Scatter(
x=[0, max_axis],
y=[0, max_axis],
mode="lines",
name="y=μ",
line=dict(dash="dash"),
),
row=1,
col=2,
)
fig.update_xaxes(title_text="predicted mean μ̂", row=1, col=2)
fig.update_yaxes(title_text="observed y", row=1, col=2)
# (3) Mean function vs x
fig.add_trace(
go.Scatter(x=x_te, y=y_te, mode="markers", name="y (test)", opacity=0.35),
row=2,
col=1,
)
fig.add_trace(
go.Scatter(x=x_te[order], y=mu_true_te[order], mode="lines", name="μ true"),
row=2,
col=1,
)
fig.add_trace(
go.Scatter(x=x_te[order], y=mu_hat_te[order], mode="lines", name="μ̂ fit"),
row=2,
col=1,
)
fig.update_xaxes(title_text="x", row=2, col=1)
fig.update_yaxes(title_text="count / mean", row=2, col=1)
# (4) Per-sample deviance distribution
fig.add_trace(
go.Histogram(x=dev_baseline_per, name="baseline", opacity=0.55, nbinsx=30),
row=2,
col=2,
)
fig.add_trace(
go.Histogram(x=dev_model_per, name="poisson reg", opacity=0.55, nbinsx=30),
row=2,
col=2,
)
fig.update_xaxes(title_text="d(y, μ)", row=2, col=2)
fig.update_yaxes(title_text="count", row=2, col=2)
fig.update_layout(
title="Fitting Poisson regression by minimizing mean Poisson deviance",
barmode="overlay",
legend=dict(orientation="h", yanchor="bottom", y=-0.18),
)
fig.show()
Pros, cons, and common pitfalls#
Pros
Proper loss for counts: corresponds to a Poisson likelihood / GLM.
Respects positivity: large penalties for predicting near-zero when events occur.
Multiplicative intuition: depends on the ratio \(r=\mu/y\) (for \(y>0\)), not just additive error.
Convenient for optimization: with a log link, the gradient is simple: \(\nabla_\beta L \propto X^\top(\mu-y)\).
Cons / pitfalls
Requires \(y\ge 0\) and \(\mu>0\).
Can be dominated by large counts / outliers (since deviance scales with \(y\)).
Assumes a Poisson-like mean–variance relationship (\(\mathrm{Var}(y\mid x)\approx\mu\)). For over-dispersion, consider Negative Binomial or Tweedie models.
Practical tips
Enforce \(\mu>0\) with a log link (
μ = exp(η)).For rates with exposures \(E_i\) (time at risk, population, etc.), use an offset:
In from-scratch code, clip \(\eta\) before
expto avoid overflow.
Exercises#
Add an exposure offset and simulate data where events are proportional to exposure.
Implement mini-batch gradient descent and compare convergence.
Create an over-dispersed dataset (e.g., Gamma–Poisson / Negative Binomial) and compare Poisson deviance against another deviance (e.g., Tweedie).
References#
scikit-learn:
sklearn.metrics.mean_poisson_deviancescikit-learn:
sklearn.linear_model.PoissonRegressorMcCullagh & Nelder, Generalized Linear Models